🎯 Objective

1. Choosing variables

Work your way through the textbook lab 6.5.1 Best Subset Selection, and the forward stepwise procedure in 6.5.2, then answer these questions.

library(knitr)
library(tidyverse)
library(ISLR)
library(skimr)
library(leaps)
library(patchwork)
library(rsample)
library(tidymodels)
library(glmnet)
data(Hitters)
# Take a look at the data
skim(Hitters)
# Handle the missing values on Salary,
# by removing them
Hitters <- Hitters %>%
  filter(!is.na(Salary))

# Subset selection
regfit.full <- regsubsets(Salary~., Hitters)
summary(regfit.full)
  1. The regsubsets() function (part of the leaps library) performs best subset selection by identifying the best model that contains a given number of predictors, where best is quantified using RSS. By default it only examines up to 8 variable models. Which variables make the best 8 variable model?
coef(regfit.full, 8)
#  (Intercept)        AtBat         Hits        Walks       CHmRun        CRuns 
#  130.9691577   -2.1731903    7.3582935    6.0037597    1.2339718    0.9651349 
#       CWalks    DivisionW      PutOuts 
#   -0.8323788 -117.9657795    0.2908431
  1. Set the max number of variables to be 19, by adding the argument nvmax=19. Plot the model fit diagnostics for the best model of each size. What would these diagnostics suggest about an appropriate choice of models? Do your results compare with the text book results? Why not?
regfit.full <- regsubsets(Salary~., data=Hitters, nvmax=19)
reg.summary <- summary(regfit.full)
#names(reg.summary)
models <- tibble(nvars=1:19, rsq=reg.summary$rsq, 
                 rss=reg.summary$rss, 
                 adjr2=reg.summary$adjr2, 
                 cp=reg.summary$cp, 
                 bic=reg.summary$bic)
p1 <- ggplot(models, aes(x=nvars, y=rsq)) + geom_line()
p2 <- ggplot(models, aes(x=nvars, y=rss)) + geom_line()
p3 <- ggplot(models, aes(x=nvars, y=adjr2)) + geom_line()
p4 <- ggplot(models, aes(x=nvars, y=cp)) + geom_line()
p5 <- ggplot(models, aes(x=nvars, y=bic)) + geom_line()
p1 + p2 + p3 + p4 + p5

BIC would suggest 6 variables. (It gets worse after 6, and then better at 8, and then worse again.) The others suggest around 10. The textbook suggests 6 variables, so similar results here.

  1. Fit forward stepwise selection. How would the decision about best model change?
regfit.fwd <- regsubsets(Salary~., data=Hitters, nvmax=19, method="forward")
#summary(regfit.fwd)

d <- tibble(nvars=0:19,  
                 rss=regfit.fwd$rss)
ggplot(d, aes(x=nvars, y=rss)) + geom_line()

Full model

coef(regfit.full, 6)
#  (Intercept)        AtBat         Hits        Walks         CRBI    DivisionW 
#   91.5117981   -1.8685892    7.6043976    3.6976468    0.6430169 -122.9515338 
#      PutOuts 
#    0.2643076

Forward selection

coef(regfit.fwd, 6)
#  (Intercept)        AtBat         Hits        Walks         CRBI    DivisionW 
#   91.5117981   -1.8685892    7.6043976    3.6976468    0.6430169 -122.9515338 
#      PutOuts 
#    0.2643076

We are using RSS, because this is returned by the forward stepwise procedure. Look for decreasing values, and when it flattens out. The suggestion is at 6 or 10 variables.

The models with 6 predictors are identical.

2. Training and testing sets with variable selection

  1. Break the data into a 2/3 training and 1/3 test set.
  2. Fit the best subsets. Compute the mean square error for the test set. Which model would it suggest? Is the subset of models similar to produced on the full data set? Do your results compare with the text book results? Why not?
set.seed(1)
split <- initial_split(Hitters, 2/3)
h_tr <- training(split)
h_ts <- testing(split)

regfit.best <- regsubsets(Salary~., data=h_tr, nvmax=19)
test.mat <- model.matrix(Salary~., data=h_ts)
val.errors <- rep(NA,19)
for (i in 1:19) {
   coefi <- coef(regfit.best, id=i)
   pred <- test.mat[,names(coefi)]%*%coefi
   val.errors[i] <- mean((h_ts$Salary-pred)^2)
}
val.errors
#  [1] 106172.08  81664.60  90858.72  81614.97  77020.89  80218.62  76664.07
#  [8]  72035.94  72137.70  70979.12  71767.23  71812.66  72429.40  71887.34
# [15]  71435.67  71834.43  71908.20  71717.66  71769.64
d2 <- tibble(nvars=1:19,  
                 err = val.errors)
ggplot(d2, aes(x=nvars, y=err)) + geom_line()

coef(regfit.best, which.min(val.errors))
# (Intercept)       AtBat        Hits       Walks      CAtBat       CRuns 
# 223.2613709  -2.8895562   8.8700075   6.0558531  -0.1430211   1.5174180 
#        CRBI      CWalks   DivisionW     PutOuts     Assists 
#   0.7689911  -0.9155342 -94.1445405   0.2777674   0.3240994
coef(regfit.full, 10)
#  (Intercept)        AtBat         Hits        Walks       CAtBat        CRuns 
#  162.5354420   -2.1686501    6.9180175    5.7732246   -0.1300798    1.4082490 
#         CRBI       CWalks    DivisionW      PutOuts      Assists 
#    0.7743122   -0.8308264 -112.3800575    0.2973726    0.2831680

The model selection of best subsets provides different results, 10 predictors, which is likely due to using a subset of data for training the model. Note that we used the test error as the measure for choosing models, and the different metric could produce different results. (Interestingly, if a different random seed is used, you might get a different best model. In this case, you should select the variables that consistently get used in the best model from different seeds.)

The selected variables are the same as the 10 variable model fitted to the full set. The coefficients in the fitted model differ a little.

3. Cross-validation with variable selection

It is said that 10-fold cross-validation is a reasonable choice for dividing the data. What size data sets would this create for this data? Argue whether this is good or bad.

There isn’t a lot of data. With 10-fold cross-validation only about 20 cases are kept out each time, which leads to substantial variability between predictions from each set. I would suggest using 5-fold. (Note from running CV: When I run this it also produces results with considerable variability. The choice of number of variables is consistent for most \(k\), even though the variability in error is substantial.)

4. Regularisation for variable selection

Here we will use lasso to fit a regularised regression model and compare the results with the best subset model.

  1. Using your results from questions 1-3, fit the best least squares model, to your training set. Write down the mean square error and estimates for the final model. We’ll use these to compare with the lasso fit.
min(val.errors)
# [1] 70979.12
coef(regfit.best, which.min(val.errors))
# (Intercept)       AtBat        Hits       Walks      CAtBat       CRuns 
# 223.2613709  -2.8895562   8.8700075   6.0558531  -0.1430211   1.5174180 
#        CRBI      CWalks   DivisionW     PutOuts     Assists 
#   0.7689911  -0.9155342 -94.1445405   0.2777674   0.3240994
  1. Fit the lasso to a range of \(\lambda\) values. Plot the standardised coefficients against \(\lambda\). What does this suggest about the predictors?
grid <- 10^seq(10,-2,length=100)
x <- model.matrix(Salary~., h_tr)[,-1]
y <- h_tr$Salary

lasso.mod <- glmnet(x, y, alpha=1, lambda=grid)
# Need a coefficient matrix of 100(nlambda)x19(p)
# as.matrix function converts sparse format into this
coeffs <- as.matrix(lasso.mod$beta) 
coeffs <- cbind(var=rownames(coeffs), coeffs)
cv <- coeffs %>% as_tibble() %>%
  gather(nval, coeffs, -var) %>%
  mutate(coeffs=as.numeric(coeffs)) %>%
  mutate(lambda=rep(lasso.mod$lambda, rep(19, 100)))
  
p <- ggplot(cv, aes(x=lambda, y=coeffs, group=var, label=var)) + geom_line() +
  scale_x_log10(limits=c(-1, 100))
plotly::ggplotly(p)
# This is how the sample code plots #plot(lasso.mod, xvar="lambda", xlim=c(-1, 5))

As seen from the few lines that are not near zero, there are just a few predictors that are important for predicting salary.

  1. Now use cross-validation to choose the best \(\lambda\).
# Do cross-validation, using glmnet's function
set.seed(1)
cv.out <- cv.glmnet(x, y, alpha=1)
#cv.df <- tibble(lambda=cv.out$lambda, mse=cv.out$cvm,
#                mse_l=cv.out$cvlo, mse_h=cv.out$cvup)
#ggplot(cv.df, aes(x=lambda, y=mse)) + geom_point() +
#  scale_x_log10() +
#  geom_errorbar(aes(ymin=mse_l, ymax=mse_h))
# This is how the sample code plots
plot(cv.out)

  1. Fit the final model using the best \(\lambda\). What are the estimated coefficients? What predictors contribute to the model?
# Fit a single model to the best lambda, predict the test set, and compute mse
bestlam <- cv.out$lambda.min
bestlam
# [1] 3.72917
x_ts <- model.matrix(Salary~., h_ts)[,-1]
y_ts <- h_ts$Salary
lasso.pred <- predict(lasso.mod, s=bestlam, newx=x_ts)
y.test <- y_ts
mean((lasso.pred-y.test)^2)
# [1] 67452.34

# Fit the model
# alpha=1 means lasso
# Its strange that it still needs the grid of lambdas to fit
# but it seems necessary for the optimisation.
# Then the predictions for best lambda made with predict fn
out <- glmnet(x, y, alpha=1, lambda=grid)
lasso.coef <- predict(out, type="coefficients", s=bestlam)[1:20,]
lasso.coef
#  (Intercept)        AtBat         Hits        HmRun         Runs          RBI 
#  190.7549874   -2.1348701    7.1951191    0.0000000    0.0000000    0.3933491 
#        Walks        Years       CAtBat        CHits       CHmRun        CRuns 
#    4.7012516  -10.4653138    0.0000000    0.0000000    0.1738710    0.6559825 
#         CRBI       CWalks      LeagueN    DivisionW      PutOuts      Assists 
#    0.4173751   -0.5238019    0.0000000 -101.2644847    0.2435969    0.1431246 
#       Errors   NewLeagueN 
#   -1.2064392    0.0000000
lasso.coef[lasso.coef!=0]
#  (Intercept)        AtBat         Hits          RBI        Walks        Years 
#  190.7549874   -2.1348701    7.1951191    0.3933491    4.7012516  -10.4653138 
#       CHmRun        CRuns         CRBI       CWalks    DivisionW      PutOuts 
#    0.1738710    0.6559825    0.4173751   -0.5238019 -101.2644847    0.2435969 
#      Assists       Errors 
#    0.1431246   -1.2064392

With the seed provided there are 13 non-zero coefficients, and an MSE of 67287.59.

  1. Does the best lasso model beat the best least squares model (best subsets)?

The best subsets performs a little better than lasso. It has lower MSE. The lasso has fewer variables, though, and thus is a little easier to interpret.

5. Making sense of it all

Only one variable is very important for the model, which is it? (It occurs in every list of variables returned as important, and has the highest coefficient in the lasso model.) Several more variables frequently appear as important, which ones are these? Several others, appear occasionally in some models but always have very small coefficients. Can you name one of these? What does this tell you about the relationship between Salary and all the predictors?

DivisionW is by far the most important variable. It is always selected, and always has a large coefficient.

AtBat, Hits, Walks persistently appear with relatively small coefficients, and appear in the lasso.

PutOuts, CBRI and Assists appear regularly with really small coefficients.

In the lasso model, it is interesting that the variable NewLeagueN appears to be important, but it is the variable who’s coefficient is reduced to 0 quickly. It also never shows up in any of the subset selection models. Also, years makes an appearance in the lasso model, and is included as a non-zero coefficient in the final model, which differs from all the subset selection models.

There is only one major variable useful for predicting salary, which is DivisionW. This variable alone provides most of the predictive power. Small gains are obtained by adding additional variables.